This page last changed on Oct 11, 2012 by tarragon.

This page describes the process to install and configure SAM-Nagios node type from scratch.

IMPORTANT
These are the installation instructions for the latest SAM release (SAM-Update-17.1).
You can find information about previous releases in the Installing SAM-Nagios page.

Environment

Disabled selinux in /etc/selinux/config

SELINUX=disabled
You must restart the box if you change this variable

Requirements

You need to install host certificate in order to secure the Nagios web portal. Certificate should be placed on the standard location:

ls -l /etc/grid-security/host*
-rw-r--r-- 1 root root 2286 Oct 28 19:26 /etc/grid-security/hostcert.pem
-r-------- 1 root root  887 Oct 28 19:25 /etc/grid-security/hostkey.pem
/etc/grid-security directory must have 755 permission and the certificate must have SSL client attribute
openssl x509 -in /etc/grid-security/hostcert.pem -noout -purpose | grep "SSL client"
SSL client : Yes

Repositories

Install YUM and rpmforge packages:

Remove the old lcg-CA repository, if installed:

  • rm -f /etc/yum.repos.d/lcg-CA.repo

Repositories List

Configure the following repositories:

  • EGI CAs (egi-trustanchors.repo)
  • gLite BDII (glite-BDII.repo)
  • gLite UI (glite-UI.repo)
  • EPEL (epel.repo)
    [epel]
    name=Extra Packages for Enterprise Linux add-ons, no formal support from CERN
    baseurl=http://linuxsoft.cern.ch/epel/5/$basearch
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL
    gpgcheck=1
    enabled=0
    protect=0
  • DAG (dag.repo)
    [dag]
    name=DAG (http://dag.wieers.com) add-on packages, no formal support from CERN
    baseurl=http://linuxsoft.cern.ch/dag/redhat/el5/en/$basearch/dag
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-dag
    gpgcheck=1
    enabled=0
    protect=0
  • SAM (rpm with repo files)
    EGI sites should use this repository instead
    [egi-sam]
    name=EGI SAM repo
    baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch
    enabled=1
    gpgcheck=0
    protect=1
    priority=10

Repository Priorities

Install yum-priorities:

yum install yum-priorities

Modify repository files:

glite-UI repo can have higher priority than rpmforge
  • rpmforge.repo
    ### Name: RPMforge RPM Repository for RHEL 5 - dag
    ### URL: http://rpmforge.net/
    [rpmforge]
    name = RHEL $releasever - RPMforge.net - dag
    baseurl = http://apt.sw.be/redhat/el5/en/$basearch/rpmforge
    mirrorlist = http://apt.sw.be/redhat/el5/en/mirrors-rpmforge
    #mirrorlist = file:///etc/yum.repos.d/mirrors-rpmforge
    enabled = 1
    protect = 0
    gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-rpmforge-dag
    gpgcheck = 1
    priority=11
    exclude=libyaml,python-django
    
    [rpmforge-extras]
    name = RHEL $releasever - RPMforge.net - extras
    baseurl = http://apt.sw.be/redhat/el5/en/$basearch/extras
    mirrorlist = http://apt.sw.be/redhat/el5/en/mirrors-rpmforge-extras
    #mirrorlist = file:///etc/yum.repos.d/mirrors-rpmforge-extras
    enabled = 1
    protect = 0
    gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-rpmforge-dag
    gpgcheck = 1
    priority=11
    
    [rpmforge-testing]
    name = RHEL $releasever - RPMforge.net - testing
    baseurl = http://apt.sw.be/redhat/el5/en/$basearch/testing
    mirrorlist = http://apt.sw.be/redhat/el5/en/mirrors-rpmforge-testing
    #mirrorlist = file:///etc/yum.repos.d/mirrors-rpmforge-testing
    enabled = 0
    protect = 0
    gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-rpmforge-dag
    gpgcheck = 1
  • glite-UI.repo
    [glite-UI]
    name=gLite 3.2 User Interface
    baseurl=http://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-UI/sl5/x86_64/RPMS.release/
    gpgkey=http://glite.web.cern.ch/glite/glite_key_gd.asc
    gpgcheck=1
    enabled=1
    priority=16
    
    [glite-UI_updates]
    name=gLite 3.2 User Interface
    baseurl=http://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-UI/sl5/x86_64/RPMS.updates/
    gpgkey=http://glite.web.cern.ch/glite/glite_key_gd.asc
    gpgcheck=1
    enabled=1
    priority=16
    
    [glite-UI_ext]
    name=gLite 3.2 User Interface
    baseurl=http://glitesoft.cern.ch/EGEE/gLite/R3.2/glite-UI/sl5/x86_64/RPMS.externals/
    gpgcheck=0
    enabled=1
    priority=16
EGI sites should use EGI SAM repository described above.
  • sa1-centos5-release.repo
    [egee-sa1]
    name=EGEE Packages from SA1 for CentOS5
    baseurl=http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/centos5/$basearch
    enabled=1
    priority=10
    gpgcheck=0
  • slc5-extras.repo
    [slc5-extras]
    name=Scientific Linux CERN 5 (SLC5) add-on packages, no formal support
    baseurl=http://linuxsoft.cern.ch/cern/slc5X/$basearch/yum/extras/
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-cern
           file:///etc/pki/rpm-gpg/RPM-GPG-KEY-jpolok
    gpgcheck=1
    enabled=1
    protect=1
    priority=1
  • slc5-os.repo
    [slc5-os]
    name=Scientific Linux CERN 5 (SLC5) base system packages
    baseurl=http://linuxsoft.cern.ch/cern/slc5X/$basearch/yum/os/
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-cern
           file:///etc/pki/rpm-gpg/RPM-GPG-KEY-jpolok
           file:///etc/pki/rpm-gpg/RPM-GPG-KEY-csieh
           file:///etc/pki/rpm-gpg/RPM-GPG-KEY-dawson
    gpgcheck=1
    enabled=1
    protect=1
    priority=1
    exclude=php*,perl-DBI,MySQL-python,c-ares,perl-DBD-MySQL,perl-SOAP-Lite
  • slc5-updates.repo
    [slc5-updates]
    name=Scientific Linux CERN 5 (SLC5) bugfix and security updates
    baseurl=http://linuxsoft.cern.ch/cern/slc5X/$basearch/yum/updates/
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-cern
           file:///etc/pki/rpm-gpg/RPM-GPG-KEY-jpolok
           file:///etc/pki/rpm-gpg/RPM-GPG-KEY-csieh
           file:///etc/pki/rpm-gpg/RPM-GPG-KEY-dawson
    gpgcheck=1
    enabled=1
    protect=1
    priority=1
    exclude=php*,perl-DBI,MySQL-python,c-ares,perl-DBD-MySQL,perl-SOAP-Lite

Installation

yum install lcg-CA
yum install httpd
yum install nagios.x86_64
# make sure that nagios from EGI SAM repository is installed
yum --exclude=\*saga\* --exclude=\*SAGA\* groupinstall 'glite-UI (production - x86_64)'
yum install sam-nagios
ARC probes. If you plan to use ARC probes to monitor ARC-CE services following additional steps are required.
UNICORE probes. If you plan to use UNICORE probes to monitor UNICORE services following additional steps are required. UNICORE probes require java so make sure that it properly installed.

Configuration

A SAM-Nagios node type can be configured in three different way in order to monitor different sets of sites/services:

  • NGI SAM-Nagios : to monitor all sites/services belonging to a given region
  • Site SAM-Nagios : to monitor all sites/services made available by one site
  • VO SAM-Nagios : to monitor all sites/services that support a given VOs (list of services can be based on VO feed or not)

The configuration of all SAM-Nagios boxes is based on https://twiki.cern.ch/twiki/bin/view/EGEE/YAIM.

NGI Nagios

For NGI Nagios instances the following variables must be set (OPS VO example):

Edit YAIM configuration file:

# Generic
SITE_NAME=egee.srce.hr
SITE_BDII_HOST=ce1-egee.srce.hr
PX_HOST=se1-egee.srce.hr
BDII_HOST=bdii-egee.srce.hr
VOS="dteam ops"
VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/"
VO_OPS_VOMSES="'ops lcg-voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'"
VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"

VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/'
VO_DTEAM_VOMSES="'dteam voms.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' 'dteam voms2.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'"
VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"

RB_HOST=skurut2.cesnet.cz # irelevant, RB is unsupported
VO_DTEAM_WMS_HOSTS="wms204.cern.ch wms205.cern.ch" # put to your NGI WMSes
VO_OPS_WMS_HOSTS="wms204.cern.ch wms205.cern.ch" # put to your NGI WMSes

# Nagios
NAGIOS_HOST=nagiosdev001.cern.ch
NAGIOS_ADMIN_DNS="/C=HR/O=edu/OU=srce/CN=Emir Imamagic"
NCG_NAGIOS_ADMIN=eimamagi@srce.hr
NAGIOS_ROLE=ngi
NCG_PROBES_TYPE=local
NCG_VO=ops

NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NAGIOS_NSCA_PASS=MY_PASS

# NGI/ROC Nagios
COUNTRY_NAME=Croatia
NCG_GOCDB_ROC_NAME=NGI_HR
NAGIOS_SUDO_ENABLE_CONFIG=true

# DB data
MYSQL_ADMIN="MY_MYSQL_PASS"
DB_PASS="MY_MRS_PASS"

MYEGI_ADMIN_NAME="Admin Name"
MYEGI_ADMIN_EMAIL="admin@address.hr"
MYEGI_DEFAULT_PROFILE="ROC" # profile to be displayed by default in MyEGI
MYEGI_REGION="NGI_HR"
If you want to speed up the process above remove package gcc from the system.

Run YAIM:

/opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS

Run myproxy-init:

Only on UI box using your dteam credentials
myproxy-init -l nagios -s se1-egee.srce.hr -k NagiosRetrieve-nagiosdev001.cern.ch-dteam -c 336 -x -Z "/DC=ch/DC=cern/OU=computers/CN=nagiosdev001.cern.ch"

Site Nagios

For Site Nagios instances the following variables must be set (using remote-only probes):

Edit YAIM configuration file:

# Generic
SITE_NAME=egee.srce.hr
SITE_BDII_HOST=ce1-egee.srce.hr
PX_HOST=se1-egee.srce.hr
BDII_HOST=bdii-egee.srce.hr
VOS="dteam ops"
VO_DTEAM_VOMS_SERVERS="'vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/'"
VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/"

# Nagios
NAGIOS_HOST=nagiosdev001.cern.ch
NAGIOS_ADMIN_DNS="/C=HR/O=edu/OU=srce/CN=Emir Imamagic"
NCG_NAGIOS_ADMIN=eimamagi@srce.hr
NAGIOS_ROLE=site
NCG_PROBES_TYPE=remote
NCG_VO=dteam
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NCG_REMOTE_USE_NAGIOS=true
NAGIOS_NSCA_PASS=MY_PASS

Run YAIM:

/opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-NAGIOS


For Site Nagios instances the following variables must be set (using all probes):

Edit YAIM configuration file:

# Generic
SITE_NAME=egee.srce.hr
SITE_BDII_HOST=ce1-egee.srce.hr
PX_HOST=se1-egee.srce.hr
BDII_HOST=bdii-egee.srce.hr
VOS="dteam ops"
VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/"
VO_OPS_VOMSES="'ops lcg-voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'"
VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"

VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/'
VO_DTEAM_VOMSES="'dteam voms.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' 'dteam voms2.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'"
VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"

RB_HOST=skurut2.cesnet.cz
VO_DTEAM_WMS_HOSTS="wms204.cern.ch wms205.cern.ch"

# Nagios
NAGIOS_HOST=nagiosdev001.cern.ch
NAGIOS_ADMIN_DNS="/C=HR/O=edu/OU=srce/CN=Emir Imamagic"
NCG_NAGIOS_ADMIN=eimamagi@srce.hr
NAGIOS_ROLE=site
NCG_PROBES_TYPE=remote,local
NCG_VO=dteam
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NCG_REMOTE_USE_NAGIOS=true
NAGIOS_NSCA_PASS=MY_PASS
If you want to speed up the process above remove package gcc from the system.

Run YAIM:

/opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS

Run myproxy-init:

Only on UI box using your dteam credentials
myproxy-init -l nagios -s se1-egee.srce.hr -k NagiosRetrieve-nagiosdev001.cern.ch-dteam -c 336 -x -Z "/DC=ch/DC=cern/OU=computers/CN=nagiosdev001.cern.ch"

VO Nagios

For VO Nagios instances the following variables must be set (CMS VO example using VO feeds).

Separate installation notes for EGI VOs were kindly provided by Gonçalo Borges (NGI_IBERGRID) and are available here.
In order to add new 'myfeed.org' VO-feed, following YAIM variables should be set:
ATP_VO_FEEDS="myfeed_org"
ATP_VO_FEED_MYFEED_ORG="http://myserver.org/vo-feed.xml"

These lines will generate:

# YAIM generated vo_feeds.conf file.
# Date: Wed Sep 5 14:36:00 CEST 2012
#
[vo_feeds]
# feed definition: - <vo_name>_topology = url
myfeed.org_topology = http://myserver.org/vo-feed.xml

Edit YAIM configuration file:

# Generic
SITE_NAME=egee.srce.hr
SITE_BDII_HOST=ce1-egee.srce.hr
PX_HOST=se1-egee.srce.hr
BDII_HOST=bdii-egee.srce.hr
RB_HOST=skurut2.cesnet.cz # irelevant, RB is unsupported

VOS="dteam ops cms"
VO_OPS_VOMS_SERVERS="vomss://voms.cern.ch:8443/voms/ops?/ops/"
VO_CMS_VOMS_SERVERS="'vomss://voms.cern.ch:8443/voms/cms?/cms/'"
VO_OPS_VOMSES="'ops lcg-voms.cern.ch 15009 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch ops 24' 'ops voms.cern.ch 15004 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch ops 24'"
VO_CMS_VOMSES="'cms lcg-voms.cern.ch 15002 /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch cms 24' 'cms voms.cern.ch 15002 /DC=ch/DC=cern/OU=computers/CN=voms.cern.ch cms 24'"
VO_OPS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"
VO_CMS_VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"

VO_DTEAM_VOMS_SERVERS='vomss://voms.hellasgrid.gr:8443/voms/dteam?/dteam/'
VO_DTEAM_VOMSES="'dteam voms.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms.hellasgrid.gr dteam 24' 'dteam voms2.hellasgrid.gr 15004 /C=GR/O=HellasGrid/OU=hellasgrid.gr/CN=voms2.hellasgrid.gr dteam 24'"
VO_DTEAM_VOMS_CA_DN="'/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006' '/C=GR/O=HellasGrid/OU=Certification Authorities/CN=HellasGrid CA 2006'"

VO_DTEAM_WMS_HOSTS="wms204.cern.ch wms205.cern.ch" # put to your NGI WMSes
VO_OPS_WMS_HOSTS="wms204.cern.ch wms205.cern.ch" # put to your NGI WMSes
VO_CMS_WMS_HOSTS="wms204.cern.ch wms205.cern.ch" # put to your NGI WMSes

# Nagios
NAGIOS_HOST=nagiosdev001.cern.ch
NAGIOS_ADMIN_DNS="/C=HR/O=edu/OU=srce/CN=Emir Imamagic"
NCG_NAGIOS_ADMIN=eimamagi@srce.hr
NAGIOS_ROLE=vo
NCG_PROBES_TYPE=local
NCG_VO=cms
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NAGIOS_NSCA_PASS=MY_PASS

# VO Nagios
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_REMOTE_USE_NAGIOS=false

NCG_USE_ATP_VO_FEED=true
ATP_ROOT_URL="https://localhost/atp"

# ATP
ATP_VO_FEEDS="<list of VOs>"
ATP_VO_FEED_<vo1>="<vo feed url>"
ATP_VO_FEED_<vo2>="<vo feed url>"

# POEM
POEM_WEB_ENABLE=True
POEM_NAMESPACE="<namespace>"
POEM_SYNC_URLS="http://localhost/poem/api/0.1/json/"
POEM_SYNC_NS_RESTRICT=""

# DB data
MYSQL_ADMIN="MY_MYSQL_PASS"
DB_PASS="MY_MRS_PASS"

MYEGI_ADMIN_NAME="Admin Name"
MYEGI_ADMIN_EMAIL="admin@address.hr"
MYEGI_DEFAULT_PROFILE=<profile> # name of the profile you have defined via POEM without namespace
MYEGI_REGION="NGI_HR"
If you want to speed up the process above remove package gcc from the system.

Run YAIM:

/opt/glite/yaim/bin/yaim -s site-info.def -c -n glite-UI -n glite-NAGIOS
You may have to run yaim twice for Update-17, as the first time it may not run correctly. Afterwards, please follow instructions in POEM User's Guide for VO-Nagios, which will help you setup initial profile POEM User's Guide.

Run myproxy-init:

Only on UI box using your dteam credentials
myproxy-init -l nagios -s se1-egee.srce.hr -k NagiosRetrieve-nagiosdev001.cern.ch-dteam -c 336 -x -Z "/DC=ch/DC=cern/OU=computers/CN=nagiosdev001.cern.ch"

Additional Configuration

Migrating from Hash_local.pm to ncg-metric-config

In case you have custom settings in the Hash_local.pm then please note that starting from Update-17 Hash_local.pm is no longer supported. After upgrade admins need to convert Hash_local.pm configuration of metrics to new format. Conversion is done in the following way:

/usr/libexec/hashlocal-to-json.pl > /etc/ncg-metric-config.d/local.conf

Note: generated config will only contain metric description. In order to include metrics in the final configuration one needs to configure POEM profile with appropriate service-metric mappings. See POEM configuration.

NCG does not support overriding metric attributes. If the metric definition exist in ncg-metric-config.conf, definition in /etc/ncg-metric-config.d/*.conf will be ignored. In order to change metric parameters use localdb.
Failover Nagios - configurable hot-standby mode

Starting from Update-13 SAM supports deployment of hot-standby (active/active) instances. The systems works in the following way:

  • Backup instance is deployed by using the same Yaim configuration with added BACKUP_INSTANCE variable described below.
  • SAM administrator opens GGUS ticket to SAM/Nagios Support Unit requesting addition of the backup host to message consumer filter (file nagios-roles.conf).
  • Backup instance constantly monitors resources, but it has the following features:
    • alarms are not sent to Operations portal
    • email notifications are disabled
    • results are not sent to the central MRS database.
    • note: results are stored to local MRS so the MyEGI shows correct history on both instances.
  • In case of failure of the main instance SAM administrator has to manually switch off BACKUP_INSTANCE variable on the backup instance.

Backup instance can be defined in several ways:

  • Via Yaim variable which sets variable BACKUP_INSTANCE in /etc/sysconfig/ncg (recommended mechanism)
    NCG_BACKUP_INSTANCE=true
  • Fast backup configuration without YAIM execution:
    • Setting variable BACKUP_INSTANCE in /etc/sysconfig/ncg (this approach can be used for fast failover):
      BACKUP_INSTANCE=true
    • Setting global variable in ncg.conf file:
      BACKUP_INSTANCE=1
    • Using ncg.pl argument:
      ncg.pl --backup-instance

In case of backup configuration without Yaim the following additional step is needed:

/sbin/chkconfig send-to-dashboard off
/sbin/service send-to-dashboard stop

In order to turn backup instance into the active one, SAM administrator has to remove BACKUP_INSTANCE variable. If the Yaim is not used the following additional step is needed:

/sbin/chkconfig send-to-dashboard on
/sbin/service send-to-dashboard start
Robot certificates

Starting from Update-09 SAM supports usage of robot certificates, instead of MyProxy credentials. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy servier.

In order to use robot certificates set the following YAIM variables:

NCG_USE_ROBOT_CERT=true
# Robot cert and key can be different for each VO
# and standard Yaim VO notation is used
VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem
VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem
VO_DTEAM_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem-dteam
VO_DTEAM_ROBOT_KEY=/etc/nagios/globus/robot-key.pem-dteam
Variables ROBOT_CERT and ROBOT_KEY use standard Yaim VO notation. If VO directories (vo.d/voname) are used, variables should put in appropriate VO files.
ACE support in MyEGI

Currently it's only for the central MyEGI instance. YAIM configuration:

MYEGI_ACE=true
Setting alternative SE for metric org.sam.WN-RepRep

Starting from the release Update-07, it is possible to specify more than one replication SE for WN replica test org.sam.WN-RepRep. Static and/or dynamic mechanisms are possible.

In order to define static list of comma-separated hostnames set the following Yaim variable:

JOBSUBMIT_WN_SE_REP=se1[,se2,se3...]

Dynamic list is filled with a list of SEs defined on the Nagios instance that recently successfully passed org.sam.SRM-All set of tests. In order to use dynamic list set the following Yaim variable:

JOBSUBMIT_WN_SE_REP_FILE=filename

Filename must be defined without path.
If the dynamic list is used metric hr.srce.GoodSEs will be associated to Nagios host. The hr.srce.GoodSEs metric generates the list of "good" SEs, as well as provides the file as input parameter to org.sam.(CREAM)CE-JobState metric(s).

The org.sam.(CREAM)CE-JobState metric(s) takes up to max 3 hosts from the file and, if JOBSUBMIT_WN_SE_REP was defined, appends them to the static list. On WN, org.sam.WN-RepRep tries to replicate to all the SEs in the provided order until the replication succeeds. The metric returns CRITICAL, if file couldn't be replicated to any for the SEs. This fixes https://tomtools.cern.ch/jira/browse/SAM-442.

Setting alternative BDII for metric org.sam.SRM-All

Metric org.sam.SRM-All uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII.

In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/srm.conf). There are two options:
1. switch to your own top BDII:

MODIFY_METRIC_PARAMETER!org.sam.SRM-All!--ldap-uri!your.top.bdii

2. use site BDII:

MODIFY_METRIC_ATTRIBUTE!org.sam.SRM-All!SITE_BDII!--ldap-uri
Setting alternative LFC for metrics org.sam.WN-Rep*

Metrics org.sam.WN-Rep* use prod-lfc-shared-central.cern.ch LFC by default.

In order to set alternative lfc create localdb file (e.g. /etc/ncg/ncg-localdb.d/LFC.conf):

MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-lfc!lfc.my.domain
MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-lfc!lfc.my.domain
Setting alternative BDII for metric org.sam.CREAMCE-DirectJobState

Metric org.sam.CREAMCE-DirectJobState uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII.

In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/creamcedjs.conf). There are two options:
1. switch to your own top BDII:

MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-DirectJobState!--ldap-uri!your.top.bdii

2. use site BDII:

MODIFY_METRIC_ATTRIBUTE!org.sam.CREAMCE-DirectJobState!SITE_BDII!--ldap-uri
Setting alternative list of CEs for metric org.sam.WMS-JobState

If the monitored infrastructure contains WMS service and no CE services, metric hr.srce.GoodCEs associated to Nagios service will fail with the following error:

HealthyNodes CRITICAL - No healthy hosts found.

There are two options to solve this issue.

1. If the infrastructure contains CREAM-CE services create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content:

MODIFY_METRIC_PARAMETER!hr.srce.GoodCEs!--metric!org.sam.CREAMCE-JobSubmit

2. In order to use static list of CEs or CREAM-CEs create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content:

REMOVE_METRIC!hr.srce.GoodCEs

For each VO supported on SAM instance create file /var/lib/gridprobes/<VO_NAME>/GoodCEs and list CE/CREAM-CE names in it. Example is:

ce1.reliable.my
ce2.reliable.my
cream-ce3.reliable.my

In case VO_FQAN is used (e.g. /ops/Role=lcgadmin) <VO_NAME> should be set to VO_FQAN with "/" replaced with "." (e.g. /var/lib/gridprobes/ops.Role=lcgadmin/GoodCEs).

Monitoring Globus services

Globus services currently do not support VOs. In order to monitor Globus services SAM administrator has to contact all sites and request to add the certificate DN to the grid-mapfile.

Removing metrics from alias only (SAM-1645)
REMOVE_ALIAS_METRIC!alias!metric
Enabling host and site contacts when global notifications (ENABLE_NOTIFICATIONS) are disabled:

It's done in localdb:

ADD_SITE_CONTACT!sitename!emailAddress
ENABLE_SITE_CONTACT!sitename!emailAddress
ENABLE_HOSTCONTACT!hostname!emailAddress
Throttling of MyWLCG WEB API

Performance limits in MyWLCG/MyEGI portal are set by YAIM variables.
This variables have default values as listed below:

# Limit number of rows that can be fetched at a time to avoid DB dumps.
MYWLCG_DB_LIMIT=50000
# Limit number of accesses per IP address in a given time(seconds).
MYWLCG_ACCESS_PERIOD=5
MYWLCG_NUMBER_OF_ACCESSES=100

Validation

After successful running of Yaim you should be able to access Nagios web interface at the address https://NAGIOS_SERVER/nagios.

If you enabled local probes make sure that you first check if MyProxy credential works by running hr.srce.GridProxy-Get-VO metric on NAGIOS_SERVER. You can do this by force scheduling check via web interface or via command line:

nagios-run-check NAGIOS_SERVER hr.srce.GridProxy-Get-VO

MyEGI interface is at the address: https://NAGIOS_SERVER/myegi.

Check resource BDII:

ldapsearch -x -LLL -h NAGIOS_SERVER -p 2170 -b Mds-Vo-Name=resource,O=grid "(GlueServiceType=*-NAGIOS)" GlueServiceEndpoint
dn: GlueServiceUniqueID=NAGIOS_SERVER_XXXXXX-NAGIOS_2937827985,Mds-Vo-name=
 resource,o=grid
GlueServiceEndpoint: https://NAGIOS_SERVER:443/nagios

Known Issues

For machines running latest version of glite-UI (3.2.10-1):
Please restart Nagios after yaim execution. Otherwise you may see problems similar to SAM-1693.

service nagios restart

Problems

A description of common problems when installing SAM can be found at the Troubleshooting section.

Document generated by Confluence on Feb 27, 2014 10:19